Introduction

The American Community Survey (ACS) provides vital information on a yearly basis about the nation and people. Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their home and other topics. Public officials, planners and entrepreneurs use this information to assess the past and plan the future. This data is used to plan hospital and schools, to support school lunch programs, to improve emergency services, to build bridges and to inform businesses looking to add jobs and expand to new markets and more. The ACS consists of 72 questions pertaining to the topics split into population and household characteristics.

In this project I worked on Housing dataset and try to find some information about vulnerable groups and mostly focused on Renters.

Data Frame Summary

Housing

Dimensions: 7487361 x 14
Duplicates: 1327802
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 State [numeric] Mean (sd) : 27.8 (15.9) min < med < max: 1 < 27 < 56 IQR (CV) : 30 (0.6) 51 distinct values 7487361 (100%) 0 (0%)
2 Units.structure [factor] 1. others 2. SingleFamily.Detached 3. SingleFamily.Attached 4. Manufactured.Housing
442600(6.6%)
4511282(66.9%)
369354(5.5%)
1423410(21.1%)
6746646 (90.11%) 740715 (9.89%)
3 Tenure [factor] 1. Owned.mtg 2. Owned.free 3. Rented 4. Others
2576950(42.0%)
1707678(27.9%)
1725632(28.1%)
121964(2.0%)
6132224 (81.9%) 1355137 (18.1%)
4 Food.Stamp [numeric] Min : 1 Mean : 1.9 Max : 2
1:775443(11.3%)
2:6097496(88.7%)
6872939 (91.79%) 614422 (8.21%)
5 HH.income [numeric] Mean (sd) : 80345.2 (88145.5) min < med < max: -21500 < 57000 < 3209000 IQR (CV) : 71600 (1.1) 56465 distinct values 6132224 (81.9%) 1355137 (18.1%)
6 Grs.rent [numeric] Mean (sd) : 1071 (612) min < med < max: 4 < 940 < 5022 IQR (CV) : 666 (0.6) 4303 distinct values 1725632 (23.05%) 5761729 (76.95%)
7 Property.Value [numeric] Mean (sd) : 276504.7 (383572.9) min < med < max: 100 < 180000 < 6308000 IQR (CV) : 220000 (1.4) 2462 distinct values 4347580 (58.07%) 3139781 (41.93%)
8 Owner.costs [numeric] Mean (sd) : 1257.3 (1050.8) min < med < max: 0 < 981 < 13392 IQR (CV) : 1151 (0.8) 9834 distinct values 4284628 (57.22%) 3202733 (42.78%)
9 No.rooms [numeric] Mean (sd) : 6 (2.4) min < med < max: 1 < 6 < 30 IQR (CV) : 3 (0.4) 27 distinct values 6746646 (90.11%) 740715 (9.89%)
10 when.moved [factor] 1. 12 months or less 2. 13 to 23 months 
·
3. 2 to 4 years 
·
4. 5 to 9 years 
·
5. 10 to 19 years 6. 20 years or more
738449(12.0%)
397311(6.5%)
1008962(16.4%)
1001457(16.3%)
1406809(22.9%)
1579202(25.8%)
6132190 (81.9%) 1355171 (18.1%)
11 Householder [factor] 1. others 2. single.m 3. single.f
5792717(94.5%)
168265(2.7%)
171242(2.8%)
6132224 (81.9%) 1355137 (18.1%)
12 No.children [numeric] Mean (sd) : 0.5 (1) min < med < max: 0 < 0 < 18 IQR (CV) : 1 (1.9) 19 distinct values 6132224 (81.9%) 1355137 (18.1%)
13 F.Kitchen [numeric] Min : 0 Mean : 0 Max : 1
0:6525705(96.7%)
1:220941(3.3%)
6746646 (90.11%) 740715 (9.89%)
14 F.Plumbing [numeric] Min : 0 Mean : 0 Max : 1
0:6570331(97.4%)
1:176315(2.6%)
6746646 (90.11%) 740715 (9.89%)

Generated by summarytools 0.9.4 (R version 3.6.1)
2019-12-18

## Summary Table
The summary table - as is shown - contains certain variables and related major data that are used in this analysis.

Boxplots are an excellent way to identify outliers and other data anomalies.The box plot of Household Income Data shows that there are also lots of outliers. In this box plot we can see lots of dots beyond the extreme line that shows potntial outliers in Gross Rent Data.Also here we do not have a symmetrical data. In Property value Data , there are outliers too. we can see that it is a skewed data.

Methodology

Missing data can reduce the statistical power of a study and can produce biased estimates, leading to invalid conclusions. In this analysis in order to handle missing values Listwise deletion is used because of its simplicity and comparability across analyses. However, this method has some disadvantages. As the summary table illustrates about 70% of gross rent are missing. this percentage for household income and property value is about 40%. It seems removing missing value leads to reducing statistical power. Also, in this method we don’t use all information and estimates may be biased.

Outliers Most parametric statistics, like means, standard deviations, and correlations, and every statistic based on these, are highly sensitive to outliers. And since the assumptions of common statistical procedures, like linear regression and ANOVA, are also based on these statistics, outliers can really mess up our analysis. Regarding the plot, boxplot is the best for presenting the outliers. As the previous box plots shows, in household income, gross rent and property value datasets, we have many outliers. In this document, at first outliers are detected and then the rows containing the outliers were removed. Removing outliers from data can be good because they are not always practical in certain sets of data. However, the removal of them can be impractical as the data might not show the true results. On the other hand, removing them can be a good thing as it can provide us with the ability to perform statistical tests on data, which in turn can give us a better understanding of the data.as we will see below the impact of outliers on the result of regression.

Association between Buidings structures and Tenure

Here in order to check if there is an association between these two categorical variables, chi squared test is used. As the result shows There is sufficient evidence (P-value<0.05) that Tenure and type pf structures are associated.

statistic p.value parameter method
2546721 0 9 Pearson’s Chi-squared test

The plot shows that most of Americans prefer to buy a single-family house on the other hand apartments and manufactural building are popular among renters. We can also see that most of householders live in their owned home and the number of owners who have mortgages is much more than who doesn’t have. Finally we can add that the detached units are the most popular one among Americans. ## Gross Rent for single males and single females.

Here we want to know if the mean Gross Rent for single females is significantly different from that of single males. So we can do a t-test to find the answer. At first, we check the normality of data. The plot shows that we do not have a normal dataset.As we saw in the begining we have lots of outliers in Gross Rent Data. So the normality was checked after removing outliers.

The second histogram has less skewness and kurtosis but it does not seem as a normal distribution yet.

In order to do a t-test we need to calculate the variance of each group.Because the data is not normal and it is a big data the levene test is used to check if the variance of each group is equal or not .
here the p-value is very small (less than 0.05) so we can say that the variances are not equal.

term df statistic p.value
group 1 102.4981 0
161366 NA NA
Finally we can do a t-test. We obtained p-value less than 0.05, then we can conclude that the mean of gross rent for two groups are not similar.
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
25.46736 1096.294 1070.827 10.77862 0 153827.9 20.83639 30.09833 Welch Two Sample t-test two.sided

Single Family Renters.

This bubble plot shows the Gross Rent to Income ratio for single males and single females for each State. At first glance, we can see a positive relationship between income and rent cost so we can say people with higher incomes prefer to live in a place with higher rent maybe with better conditions and the ratio is the same for all states and for both groups the trend is same. this plot also helps us to know what is the States in the extreme part of the graphic, or what is the one out of the general trend. For instance, Average Rent and average Income for the district of Columbia are much higher than in other states. Also, the biggest bubbles are related to California, NewYork, texas so we can say in these States there are many more single families than other states. another interesting thing is that the differences between average income for single males and females in California are high we can see the same thing in Texas.

Supplemental Nutrition Assistance Program Benefits

SNAP is a program for low-income households to obtain a more nutritious diet. Here the percentage of households received this program for each state are calculated. We can say states like Wyoming, Utah, Colorado, Nebraska a smaller percentage of households benefit from this program. At first glance 4 white states in the center of map attract attentions . white color says that just 6 to 8 percentage of household in these states receiving SNAP which means there are fewer low-income household but It seems better to compare the interpretation of this plot with povery rate for each state because if these states are poor it means that the SNAP does not work properly there.

Owner Cost

This plot compares Owner Cost for each state. We can say that Owner Cost in District of Colombia, New Jersey, Connecticut, California, Massachusetts is much higher than others. On the other hand, states like West Virginia, Arkansas, Mississippi and Oklahoma have the lowest Owner Cost.

Rent-to-Income Ratio

In the plot we can see In North Dakota, Wyoming, South Dakota, and Iowa more than 55 % of household income is spent on renting a house which is a big number. Hawaii, Florida and California are the best states to live for tenants.

Relationship between Number of children and Rent Cost.

The plot shows the relation between these two variables.

As it was mentioned there are lots of outliers in this dataset so before conducting the regression the outliers was removed.

Before conducting a regression model we should check the regression assumptions.

Looking at the Residuals vs Fitted plot, we see that the red line is perfectly flat. This tells us that there is no discernible non-linear trend to the residuals. Furthermore, the residuals appear to be equally variable across the entire range of fitted values. There is no indication of non-constant variance.

The second plot Normal Q-Q shows residuals are normally distributed because residuals follow a straight line but the residuals deviate from the diagonal line in the lower tail because of outlier.

The Scale-Location plot shows whether our residuals are spread equally along the predictor range.In this case,the red smooth line is not horizontal also the spread around the red line varies with the fitted values which suggest heteroskedasticity.

The last plot is the Residuals vs Leverage plot tells us which points have the greatest influence on the regression. Here there are no points outside the dotted line.and we can say our plot doesn’t show any influential cases.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.8928174 0.0229300 300.602791 0.0000000
No.children 0.0148514 0.0040068 3.706529 0.0002328

Discussion

The survey shows that tenants in some states have financial problems as, in some states the rent cost is half of householders income and short-term contracts are trend in these states .besides it seems that some programs such as SNAP is not executed there. which these two factor cause some economic problems for people who live in these states. However , I think for do a better analysis more factore should be considered.

Incorperating the weight (and inflation adjustment) is essential for getting correct and meaningful results, but because of complexity I ignor these factors.

I checked the relation between number of childrens and rent cost . before starting the anlysis there is no relation between two variables but it seems there are a strong relationship. On the other hand becuse removing outliers and missing values I think may be the model is not as correct as it shows.